-
-
Notifications
You must be signed in to change notification settings - Fork 6.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bugfix][V1] Fix allowed_token_ids for v1 Sampler #14169
Conversation
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
Signed-off-by: Lu Fang <[email protected]>
2439e4a
to
a5457a6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks for fixing this and I left two minor comments!
vllm/v1/engine/processor.py
Outdated
if params.allowed_token_ids is not None and len( | ||
params.allowed_token_ids) == 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be just if not params.allowed_token_ids
since if params.allowed_token_ids is None
it would have already returned above.
vllm/v1/worker/gpu_input_batch.py
Outdated
self.allowed_token_ids_mask = torch.zeros(self.max_num_reqs, | ||
self.vocab_size, | ||
dtype=torch.bool, | ||
device=self.device) | ||
self.allowed_token_ids_mask_cpu_tensor = torch.zeros( | ||
self.allowed_token_ids_mask = torch.ones(self.max_num_reqs, | ||
self.vocab_size, | ||
dtype=torch.bool, | ||
device=self.device) | ||
self.allowed_token_ids_mask_cpu_tensor = torch.ones( | ||
self.max_num_reqs, | ||
self.vocab_size, | ||
dtype=torch.bool, | ||
device="cpu") | ||
self.allowed_token_ids_mask_cpu_tensor[req_index][ | ||
sampling_params.allowed_token_ids] = True | ||
sampling_params.allowed_token_ids] = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the updated logic here is a bit counter-intuitive to readers from a glance - can we add a NOTE
here that all token ids with True
mask will be filled with float("-inf")
during sampling?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, added some comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @houseroad, looks good.
vllm/v1/engine/processor.py
Outdated
if params.allowed_token_ids is not None and len( | ||
params.allowed_token_ids) == 0: | ||
raise ValueError("allowed_token_ids is not None and empty!") | ||
if not all(0 <= tid < self.model_config.get_vocab_size() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if params.allowed_token_ids is not None and len( | |
params.allowed_token_ids) == 0: | |
raise ValueError("allowed_token_ids is not None and empty!") | |
if not all(0 <= tid < self.model_config.get_vocab_size() | |
if not params.allowed_token_ids: | |
raise ValueError("allowed_token_ids cannot be empty") | |
vocab_size = self.model_config.get_vocab_size() | |
if not all(0 <= tid < vocab_size |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make sense, addressed the comments
Signed-off-by: Lu Fang <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please check the discussion on slack regarding self.allowed_token_ids_mask
- we need to take into account the situation where requests have different sampling params, which is fairly common in the online setting.
@liangfu yes thanks to @robertgshaw2-redhat for pointing out we only want to invert the mask for the applicable rows. So the tensors should still be initialized to zeros / reset to zeros when the request is removed, and when adding the requests, just fill the row to 1's before setting the allowed tokens to zero. |
…ctly Signed-off-by: Lu Fang <[email protected]>
vllm/v1/worker/gpu_input_batch.py
Outdated
@@ -359,7 +362,7 @@ def remove_request(self, req_id: str) -> Optional[int]: | |||
self.logit_bias[req_index] = None | |||
self.has_allowed_token_ids.discard(req_id) | |||
if self.allowed_token_ids_mask_cpu_tensor is not None: | |||
self.allowed_token_ids_mask_cpu_tensor[req_index].fill_(False) | |||
self.allowed_token_ids_mask_cpu_tensor[req_index].fill_(True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@houseroad this should also be reverted right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True, added more comments to help understand
…ctly Signed-off-by: Lu Fang <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @houseroad, LGTM now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the fix!
This was missed when merging vllm-project#14169 and vllm-project#14159 Signed-off-by: Nick Hill <[email protected]>
Signed-off-by: Lu Fang <[email protected]> Signed-off-by: Johnny <[email protected]>
Revert the one and zero, so masked_fill_ is applied appropriately.
Tested with 'test_allowed_token_ids' in #14159
Additional test: